Systems and Methods for Enhancing Web-Based Searching

ABSTRACT

A system for enhancing web-based searching is provided. Categorizing and clustering techniques are used to optimize searching. Businesses are classified using a control group of predetermined categories. The predetermined categories may be SIC codes or headings that are used to describe business activities. The website addresses for a business listed in the control group is determined, and the content of the business&#39;s website is extracted. The extracted content is associated with the predetermined category that the business is classified under. The extracted content is used to further enhance the overall classification scheme. The system may compare and match the extracted content with content of other business&#39; websites, which are similarly categorized. If a relevant keyword match is identified in several of the websites, the keyword may be used to update the classification scheme. A new category or sub-category can be created based on this keyword. Furthermore, when a search is performed, the search results are organized by these categories, and using various processes, the most common results are kept and the less relevant results are discarded.

RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.10/973,660, filed Oct. 26, 2004, which is a continuation-in-part of U.S.application Ser. No. 10/856,351, filed May 28, 2004, which claims thebenefit of U.S. Provisional Application No. 60/474,559 filed on May 30,2003, the entire teachings of which are incorporated herein byreference.

BACKGROUND

As the Internet has evolved over the years, there have been a myriad ofideas and schemes used to facilitate information retrieval. The amountof information on the web is growing rapidly, as well as the number ofnew users who are inexperienced in the art of web research.Increasingly, information gathering and retrieval services are facedwith a market full of users that want to be able to search for veryspecific information, as quickly as possible, and without being burdenedwith false positives.

Users are likely to navigate the web using human maintained indices,such as YAHOO! and the online yellow-pages, or search engines such asGOOGLE. Human maintained indices cover popular topics effectively;however, they are subjective, expensive to build and maintain, slow toimprove, and cannot cover all esoteric topics. Such lists generallygroup information by predetermined categories. For instance, the onlineyellow pages organizes its listings by a standard industry code (SIC)scheme. YAHOO! also is based on a taxonomy structure but provides aclass-generalization hierarchy of categories to support moresophisticated browsing.

Although human intelligence is used during the classification processfor such indexed schemes, this classification process still suffersdrawbacks. For example, the quality of web content classification isoften skewed as a result of individual reviewer bias. Also, the growthof web content has made it virtually impossible to maintain anup-to-date database of classified web content. The predeterminedcategories that were once effective to classify information may becomestale within a short period of time.

Instead of using indices services, a user can retrieve information usingsearch engines. Search engines, such as GOOGLE, allow a user to enter aquery and will return a set of results based on the text from thatquery. When a query is initiated, the returned set of search results isdisplayed on one or more search web pages with the search result “hits”arranged in a ranked order. The methodologies that are used to selectthe hits and rank the search results from most relevant to leastrelevant vary from search engine to search engine. As a result,performing an identical query on two different search engines rarely, ifever, yields the same set of search results. Even if an identical set ofsearch results is returned, the order in which the search result hitsare presented will vary.

The methodologies used by the search engines to determine hits alsotypically yield search results that include irrelevant hits. Forexample, if a user looking for “house plans” initiates a search using aweb-based search engine, the set of search results may include hitsrelating to “dog house plans”, “bird house plans”, and/or hitsdiscussing “budget plans for the white house”. In some cases, themajority of the hits will be in the same category and relevant to thesearch query. Unfortunately, in other cases, very few of the hits willbe in the same category and many of the hits will be irrelevant to thesearch query. This, of course, makes searching frustrating for users.

To address these issues, increasingly the trend is to incorporate aclustering algorithm that clusters search results by grouping certainhits together. Examples of search engines that perform hit clusteringinclude Teoma and Fast. Using such automated clustering techniques isnot surprising, given the large number of hits many search queriesreturn, because reliance on people resources to classify billions of webpages into groups is impractical. Unfortunately, these automatedcomputer-driven clustering technologies are rudimentary and prone toerror, since no human intelligence is applied to assign context to thesearch query.

Consumers, for example, want to input minimal information as searchcriteria and in response, they want specific, targeted and relevantinformation. Being able to match a consumer's query to a proper businessname is very valuable as it can drive a transaction (e.g., a sale).Accommodating these demands effectively unfortunately requires humanintelligence, which is not easily captured into a search engine or indexscheme without investing in an involved and expensive process. Thedifficulties of this process are compounded by the unique challengesthat companies face to make their presence known to consumers on theinternet.

Thus, one of the most complicated aspects of developing an informationgathering and retrieval model is finding a scheme in which the costbenefit analysis accommodates all participants, e.g., the users, thebusinesses, and the search engine providers. At this time, the currentlyavailable schemes do not provide a user-friendly, provider-friendly andfinancially-effective solution to provide easy and quick access tospecific information.

SUMMARY

In today's dynamic global environment, the critical nature of accuracyand efficiency in online information retrieval can mean the differencebetween success and failure for a new product or even a company. Userswant specific information, and this information may be targeted to aspecific product and to a business in a particular location that carriesthat product. In addition, users may want to know about other businessesthat may also carry that same product or similar products. The currentinformation gathering and retrieval schemes are unable to efficientlyprovide a user with such targeted information. Nor are they able toaccommodate the versatile search requests that a user may have.

The invention relates to a scheme for optimizing information retrieval.An embodiment of the invention relates to a classification system, whichis used to optimize Internet searching. Preferably, the system utilizesan existing collection of information that contains verified informationabout businesses that has been categorized using predefinedclassifications.

The yellow pages are one example of such a predefined classificationthat can be used. The yellow pages organizes its listing by subject,using the SIC codes and/or yellow page headings. Generally, businessesin the online yellow pages have an SIC heading or classification thatits listings are alphabetized under. These SIC headings, however, followarchaic naming methods, which are often incongruous with typical userqueries. For example, the heading for a ladies shoe store isshoes-retail, whereas a consumer is likely to type in “ladies shoes” or“ladies high heel shoes”. Brand names present another problem. Forexample, someone may type in “NIKE” when searching for “running shoes”and neither matches the yellow pages heading, “shoes-retail”. In orderto compensate for this, yellow pages firms build a table of synonyms sothat when someone types in, for example, “ladies high heel shoes” theycan match this to a yellow page category. This process is very timeconsuming as the matching is done manually and there are hundreds ofmillions of phrases to match. There are some tricks involving rootexpanders, such as by adding/subtracting characters, (e.g., adding “s”to match “restaurant” to “restaurants”) but this only helps a fractionof the cases. Further, this manual processing is very expensive andlabor intensive, as synonym tables quickly go stale as a result of theconstant influx of new phrases incorporating new brands. Being able tomatch a consumer's query to a proper business name is very valuable asit can drive a transaction (e.g. sale), and the business will pay forthis service of being connected to the user.

Embodiments of the invention relate to an information gathering systemthat can continuously expand existing classification schemes into morecomprehensive, versatile and efficient taxonomies. Existing predefinedtaxonomies, such as the yellow pages, which assign categories toentities or businesses, may be used as part of the foundation of thesystem. The website addresses for a business, which is listed in theyellow pages may be determined. The content of the business's websitemay be extracted and associated with the yellow pages category of thebusiness. The extracted content may be used to further enhance theclassification scheme by defining a relationship between the extractedcontent and the yellow pages category of the business. For example, theextracted content may be compared with the extracted content of anotherwebsite of a business. Content extracted from a plurality of websitesmay be analyzed to identify matching keywords or phrases. Machinelearning and processing techniques may be used to identify matches inthe extracted content. If a match identified is determined to be thebest match, e.g., the highest ranking match, the match may be used toupdate the classification scheme. In this way, a new category orsub-category can be created based on this keyword or phrase. Asub-category, for example, may be created if a certain percentage of thematches are derived from websites of businesses that have beencategorized under the same predefined classification, such as the sameyellow pages business descriptor. For example, if the keyword match is aproduct name, such as NIKE, the system may use this product name tocreate a sub-category under shoes-retail. In this way, the system cancontinuously update its classification scheme to optimize searching. Ifa user queries the system for NIKE, the system can respond with all ofthe websites of businesses classified under this sub-category.

If a user queries the system for a term, which the system had notpreviously categorized within the classification scheme, all websitescontaining potential matches may be grouped and identified and thesystem can produce search results based on the best match. For example,the system categorizes each business's website and businesses within theclassification scheme. The content of each business's website is indexedin a table associated with the website. The queried term can then beprocessed through the indexed content of the websites. All hits can befiltered or tuned using processing techniques that enable the system toidentify the best match. When the system determines that it hasidentified the correct match, the user's search query may be used toupdate the classification scheme by creating a new category orsub-category based on the query. The system can, therefore, use theuser's query to update its classification scheme.

Businesses that do not have a website can be assigned to categoriescreated in the classification scheme. If, for example, the business isassociated with a yellow pages classification, such as shoes-retail, itcan be associated with any sub-categories under the node shoes-retail,such as NIKE. By associating new classifications to businesses that donot have websites, businesses without websites can be linked topotential customers. This can be particularly useful when a usersearches for a business in a particular geographical location that doesnot have any businesses with websites that relate to the query. Forexample, if a consumer queries the system for businesses in a particularlocation that carry NIKE, and the system does not identify anybusinesses in that location associated with NIKE that have websites, thesystem can link the query to businesses that may not have websites, butare associated with NIKE because they are under the shoes-retailclassification.

The classification scheme may be defined as a hierarchy of relations.Business related categories defined by the yellow pages, for example,may be used in defining the hierarchy. The hierarchy may be aclassification hierarchy. A plurality of relations corresponding to eachcategory in the hierarchy may be defined. The relations may beindicative of an association between the category and website contentassociated with a business classified under that category. The relationsmay be indicative of an association between the category, the websitecontent, and a new category, which has been defined based on the websitecontent. The relationships in the hierarchy may be defined according togeographical location.

The website content of a business associated with a predeterminedclassification, such as an SIC classification, yellow pages, etc. may beused to create a control group. The control group may correspond toattributes about entities that are associated with predefinedcategories, which have been verified by an independent source. Thecontrol group may include attributes of the entities, such as thebusiness' names, addresses, telephone numbers and website addresses. Thecontrol group may be used to search for other businesses that have notbeen categorized, and to assign categorizations to these businesses.Attributes in the content of an unclassified business's website can beused to further develop the classification scheme.

The categorized entities in the control group may be stored according toa classification hierarchy of relations. An unclassified business'swebsite may be categorized in the hierarchy by comparing the content ofthe unclassified website with the website content in the control group.For example, the unclassified business's website may be assigned to aclassification in the hierarchy by identifying matching content (e.g.identifying matches in the extracted website content of the unclassifiedcontent, that match content associated with a category in theclassification hierarchy).

Techniques may be used to optimize clustering of search results. Theclassification hierarchy (classification scheme) may be used to clustersearch results. The search results can be clustered based onpredetermined criteria. For example, a user may submit a search query tothe system. The system may extract website content associated with thesearch results. The search results may be grouped based on keywordsidentified in the extracted website content. The groups may correspondto the categories in the classification hierarchy. Groups that have hitcounts below a specified level can be removed. The remaining groups ofhits can be analyzed. The system may extract the website contentassociated with the search results and compare the content against thewebsite content of entities categorized in the control group. The numberof matches can be identified, matched, and counted. The search resultmay be assigned to the category that corresponds to the highest rankingmatch.

A web-based search engine may be provided that returns hits based on atext query. Filtering and organizing techniques may be used to discardhits returned during the query that are irrelevant to the context of thequery. Such techniques may be used to group remaining hits into relatedcategories. Search results may be filtered to remove irrelevant hits andthe remaining hits are grouped into categories resulting in groups ofhits that are likely to be related to the context of the search query.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, features and advantages of theinvention will be apparent from the following more particulardescription of preferred embodiments of the invention, as illustrated inthe accompanying drawings in which like reference characters refer tothe same parts throughout the different views. The drawings are notnecessarily to scale, emphasis instead being placed upon illustratingthe principles of the invention.

FIG. 1 is a schematic diagram that depicts the architecture of a systemfor classifying information according to an embodiment of the presentinvention.

FIG. 2 is a flow diagram that depicts a process for classifying websitesaccording to an embodiment of FIG. 1.

FIG. 3 is a flow diagram that depicts the process performed during aweb-based search using a full text indexed search engine in accordancewith the present invention.

FIG. 4 is a diagram that depicts a graphical user interface of a mock-upfull text indexed search engine incorporating clustering to group searchresults.

FIGS. 5A-B are diagrams that depict graphical user interfaces of fulltext search engines incorporating clustering to group search results.

DETAILED DESCRIPTION

A description of preferred embodiments of the invention follows.

System Architecture

Preferably, the invention is implemented in a software or hardwareenvironment. One such environment is shown in FIG. 1. In this example, asystem 10 is provided for classifying information. The system 10includes a data extraction tool 30-4 that crawls the web 15. The dataextraction tool 30-4 extracts full text (attributes from websites). Theextracted attributes may relate to information about a business, such asits products, activities, physical location, customers, services, etc.The business may or may not have a website on the Internet. The dataextraction tool 30-4 may interface with existing search engines 20, suchas GOOGLE or Yahoo!, to extract any indexed information for websites.

A content analyzer 40-4 is used to analyze and classify the extractedattributes, which are received from the data extraction tool 30-4. Thecontent analyzer 40-4 categorizes the extracted attributes using variousmachine learning and matching techniques. While categorizing theextracted attributes, the content analyzer 40-4 interfaces with, amongother things, a persistent storage that consists of a collection ofinformation 25. Preferably, the information is organized in ahierarchical structure. The collection of information 25 may be adatabase that includes information categorized using an existingtaxonomy, which has been obtained from a verified or independent source.

Existing taxonomies that assign categories to entities or businesses areused as part of the foundation of the system 10. One such example is thestandard industry classification (SIC) scheme (see the North AmericanIndustry Classification System, which aims to provide a large publictaxonomy for reuse. SIC codes are numerical values used to categorizeand uniquely identify business activities. Each SIC code corresponds toa business descriptor. The yellow pages, for example, typically uses thebusiness descriptors associated with SIC codes to classify content.Another example of a classification system is Open Directory (also knownas DMOZ), which provides a user-generated dictionary for the web. Otherpotential sources of categorized information include BETTER BUSINESSBUREAU membership list and American Association of Retired Persons(AARP) membership list. It should be noted that any standardizedclassification scheme is compatible with the invention.

The business information listed in the yellow pages, membership lists,etc, is stored in the database 25. The website content of the businesseslisted is extracted and associated with the business's in theclassification hierarchy in the database. The website content of thebusinesses, and the categories of assigned to the businesses by theyellow pages, for example, are stored in the database 25 and used toestablish a control group, which defines a classification scheme(classification hierarchy) for the system 10.

When crawling, categorizing, searching or clustering web content, thesystem 10 may use a variety of processing techniques 40. Theseprocessing techniques may include distiller 40-1, domain name analyzer40-2, parsers 40-3, content analyzer 40-4, clustering logic 40-5, andfilters 40-6. The domain name analyzer 40-2, for example, may be used toanalyze domain names in URL addresses identified when crawling orsearching the web 15. The parsers 40-3 and filters 40-6 may be used bythe system 10 to target a user's search query to a specific context,category or subject. The distiller 40-1 may be used to eliminate falsepositives from search results. Preferably, the domain analyzer 40-2,parsers 40-3, filters 40-6 and distiller 40-1 are implemented using thetechniques described in U.S. application Ser. No. 10/772,784, filed Feb.5, 2004, the entire teachings which are incorporated herein byreference.

The database 25 preferably includes a classification hierarchy thatdefines a plurality of relations. The relations include yellow pagescategories, businesses associated with those categories, geographicallocations associated with those businesses, and website content of thebusinesses. For example, the content analyzer 40-4 interfaces with thedata extraction tool 30-4 to extract website content of websitesassociated with businesses listed in the yellow pages. This websitecontent is associated with the yellow pages category related to thebusiness. This can be used to create a plurality of relations thatassociate the website content of a business with its yellow pagescategory in the classification hierarchy.

Algorithm

FIG. 2 shows a process for classifying websites according to anembodiment of FIG. 1. At step 210, the websites of businesses that areassociated with known classifications, such SIC codes, are identifiedand crawled. The URL address for each business's website may bedetermined using techniques described in U.S. application Ser. No.10/620,170, filed Jul. 15, 2003, and U.S. application Ser. No.10/772,784, filed Feb. 5, 2004, the entire teachings of which areincorporated herein by reference.

Content is extracted using the data extraction tool 30-4 from theseclassified websites, such as full text, keywords, descriptions, titles,metadata, etc., and this information is stored in an indexed database25, as described at 215. The content of each business's website, forexample, may be stored in a table index. It should be noted that thesystem may gather some or all of this content using existing Internetsearch engines, such as GOOGLE, and directory based Internet services,such as YAHOO.

At 220, a business's website content is correlated with the business'sSIC classification. This content is stored in the database 25 andassociated with the business yellow pages descriptor or SIC code. Anyattributes, such as keywords extracted from the business's website, areassociated with the business's yellow pages descriptor. Such keywordsmay include the brands names and product information discussed on thewebsite. For example, if the business is categorized in the yellow pagesunder Retail Shoe Store, the system 10 may associate keywords in itscontent with the Retail Shoe Store descriptor. If, for instance, thewebpage includes brand names, such as “NIKE” and “ADIDAS,” as well asdescriptive key terms, such as “running shoes” or “ladies high heels”,this content may be extracted from the website and then associated withthe Retail Shoe Store category.

As this process is repeated for each website of a business that isclassified under a SIC category, each SIC category becomes morecomprehensive and versatile. In addition, the key terms used to furtherdevelop each SIC category can be divided into sub-categories usingranking or pattern recognition algorithms discussed in more detailbelow. In this way, content extracted from websites of businesses can beused to build and enhance each yellow pages descriptor or SIC category.Computational techniques can also be used to increase the relevancy ofthe resulting index by filtering irrelevant key terms that are extractedfrom the sites, as will be discussed in more detail below.

At 225, the crawler crawls the web and extracts content from websitesthat have not been classified. These “unclassified” websites are notknown to be associated with a business that has been assigned an SICcategory or yellow pages descriptor. Unclassified websites may also belocated using a search engine. The keywords that were extracted from theclassified websites and stored under yellow pages descriptors or SICcategories in the database 25 can be used as search criteria to searchfor unclassified websites. Any websites identified in the search resultsthat have not been classified can be candidates for SIC classification.

Content extracted from unclassified websites is compared against theclassified content at 230. At 235, the content of an unclassifiedwebsite is matched with content associated with an SIC heading ofclassified websites in the indexed database. This can be performed usingmachine learning and pattern recognition techniques that identifyrelationships between the content extracted from an unclassified websiteand the key terms categorized under an SIC heading. The system 10determines candidate categories for the unclassified websites at 240.Computational techniques can be used to cluster and rank the potentialcandidate categories under which the unclassified website maypotentially be classified. At 245, the unclassified website is assignedto an SIC category in the database.

Preferably, this embodiment works similarly to the way clusteringengines work now, but is unique in that it uses a predefined humanlyclassified control group (yellow page headings, SIC codes) to organizethe clusters, yet also allows the control group to grow if a large groupof results form a cluster but this cluster does not fit into anyspecific predefined category. As an example, the term ISP is notclassified in the SIC codes which were last updated in 1984 before theInternet was very prevalent. However the term ISP may occur in manysites and form a cluster, labeled ISP providers, that gets added to thecontrol group as its own category. The clustering of ISP providers usingkeywords itself is not novel but is an improvement upon the main idea ofusing a control group to map clusters of results. Mapping clusterswithout a control group as used by engines such as Vivisimo and Teoma isvery scalable but quite unwieldy and very difficult to interface withanother fixed hierarchy system.

The content from the unclassified websites may also be used tosupplement the key terms associated with an SIC category. Sub-categoriescan be created using the key terms under the SIC category. This mayfurther enable each SIC category to become more comprehensive. Rankingalgorithms may be used to filter irrelevant content. In this way, thesystem can avoid associating the SIC categories with irrelevant keyterms extracted from the unclassified websites.

Increasing the Relevancy of Resulting Indices

By performing the process described in FIG. 2, an unclassified websitecan be classified. When all of the potential matches that anunclassified website may be classified under are counted, the relevantmatches occur far more often than the mismatches. For example, if theunclassified website includes the key phrase “ladies high heel shoes”,and this phrase is compared against the indexed database, the followingSIC matches may result;

-   -   Shoes-retail [23739]    -   Custom and orthopedic shoes [9567]    -   Leather goods [9382]    -   Bridal shops [7599]    -   Kids clothing [3453]    -   Apparel and garments-retail [1987]    -   Clothing-retail [1852]    -   Radio stations [745]    -   Shoe Manufacturers [546]    -   . . .    -   Credit Unions [234]    -   . . .    -   Toys-retail [36]    -   Shipping yards [7]

Total results [68459] total categories [137] average [500] median [110]

There are a number of techniques that can be used to determine the bestmatch, among a number of potential matches, under which an unclassifiedwebsite should be categorized (performed at 240 of FIG. 2). Thefollowing are some example techniques that could be used to determinethe best match:

-   -   1. Only the highest match is accepted.    -   2. Only records above a certain percentage of the total are        matched.    -   3. Only the records that are part of the top X percent are        included.        -   a. For example, if 50% [34230] then the first three would            match [23739+9567+some of 9382].    -   4. Only the top record and records which were over x% of the top        listing.        -   a. For example, if 30% then all records with more than 7121            [23739×30%] would match which is 3 more records in this            example.    -   5. Only records that are above a multiple of the average.        -   a. For example, if 10×, then only records above a count of            5000 would be included.    -   6. Only records in the same parent category as the main        category.    -   7. Only records which contain one of the words from the yellow        pages category.    -   8. Only records which contain one of the words from the main        listing but common words in a defined list are excluded (e.g.        retail, manufacturers, products, associations).        -   a. Handles irrelevant cases from above example.        -   b. Could have a minimum value requirement to exclude firms,            such as ‘shoe manufacturers’.    -   9. Number of results could be fixed to 1, 2, 3, or other number.    -   10. All or portion of the results would be saved and depth        specified by the consumer.

The above listed techniques are examples of approaches that may be usedto determine the best match in the search results.

Indexed Hierarchy

Traditionally, the online yellow pages enables a user to identify abusiness by specifying a location and a business name or its category.For example, the query “restaurant” would match “Fred's FamilyRestaurant” or “Bob's Diner”. “Fred's Family Restaurant” is a businessname match and “Bob's Diner” matches the yellow pages headingrestaurant. Generally, businesses in the online yellow pages arealphabetized under a common business descriptor, which corresponds to aSIC heading. Headings alone, however, are often not very useful in thistype of matching because they follow archaic naming methods, which areoften incongruous with typical user queries. For example, the yellowpages heading for a ladies shoe store is Shoes-Retail; whereas, aconsumer is likely to type in “ladies shoes” or “ladies high heelshoes”.

Embodiments of the system described in FIGS. 1 and 2 use machinelearning techniques to associate the yellow pages category, for example,shoes-retail, with keywords extracted from business's websites. Inparticular, all businesses under the parent node Shoes-Retail would beclassified under that node in the indexed hierarchy 25 stored in thedatabase. Common attributes extracted from those websites would beassociated with a sub-category under the Shoes-Retail node. For example,if “high heels” was an attribute common to businesses listed under theShoes-Retail node, then the system would learn that High Heels is asub-category of Shoes-Retail.

Geographic Searching

The system 10 is able to optimize searching by using a geographicalcontext in the search criteria. Because the yellow pages listings areorganized based on geography and business category, the system 10preferably associates each business and its attributes with its physicallocation. This becomes important, for example, when a user queries thesystem 10 to find a local restaurant or a business that sells a certaintype of product. Consider the situation where the query is for businessthat sells a “waxing” product. In this example, the system would be ableto determine whether the user is looking for “leg waxing” or “surfboardwaxing” products if the user specifies a particular geographicallocation. For instance, if Hawaii or California is specified, then thesystem 10 may conclude that the user is looking for “surfboard waxing”products, and thus return results responsive to this request. If thesystem 10 is unable to readily determine an appropriate search categoryfor the query, the system 10 may present the user with the potentialcategories, which are responsive to the query.

Furthermore, the system 10 also categorizes businesses that do not havea virtual location (e.g., a website), but are listed in the yellowpages. These businesses, which do not have websites, are associated withattributes extracted from websites of businesses that are classifiedunder the same yellow pages heading. For example, attributes extractedfrom a website, which is classified under the yellow pages heading ToysRetail, are associated with businesses that do not have websites but arealso classified under Toys Retail. This addresses the situation where auser queries the system for business that sells a product, such as aspecific toy in a certain location, and although the system 10 is ableto identify businesses classified under Toys Retail in that location,none of the businesses identified actually have websites. This matchingconnects the user with the business and can potentially drive a sale forthe business identified. In this way, the system 10 is able to match aconsumer's product query to a proper business name even if the businessdoes not have a website, thus, creating a connection between the userand the business.

Web-Based Clustering

Search results returned during a web-based query 30-1 are filtered 40-6to discard irrelevant hits. The remaining relevant hits are thenorganized into groups by a clustering algorithm 40-5 according to ahierarchy of relations (“clustered”), which yields a relevant set ofsearch results consistent with the meaning and intent of the searchquery. In one embodiment, the clustering methodology is used inconjunction with one or more full text-indexed search engines 20.

Referring to FIG. 3, a flowchart showing the process performed during aweb-based search using a full text-indexed local search engine inaccordance with the present invention is illustrated. When a user wishesto initiate a search, a query is entered into the full text-indexedsearch engine at 310. Once the query has been initiated, the searchengine performs a full text index search of available web pages at 320.The web pages that satisfy the query text are then returned, yielding aset of hits at 330. Hits returned in response to the query are thenmatched against a hierarchical criteria database at 340. In particular,content is extracted from the website hits and compared againstclassification hierarchy stored in the hierarchical criteria database.

The hierarchical criteria database 25 may be created in accordance withthe processes described in FIGS. 1 and 2. The database may includebusiness information that has been obtained from the an independentsource, such as the yellow pages. The business information may besupplemented with extracted website content (fully indexed) associatedthe businesses. This extracted website content may be assigned to thebusiness category associated with the business in the classificationhierarchy. For example, the database 25 includes a classificationhierarchy that defines a plurality of relations. The relations includeyellow pages categories, businesses associated with those categories,and geographical locations associated with those businesses. Inclustering the search results in accordance with FIG. 3, the websitecontent of the search results may be extracted and matched against thecategories in the classification hierarchy, and the content associatedwith the categories stored in the hierarchal database 25.

During this process, hits that do not match against the database arediscarded. Hits that match against the database are organized intogroups based on the hierarchy of the database according to set criteria.As a result, hits are grouped in a manner that is consistent with theintent and meaning of the search query. In one preferred implementation,the criteria database includes a classification hierarchy, which definesa plurality of relations associated with businesses and SIC codes,allowing the relevant hits to be grouped according to SIC code.

After the hits have been grouped, each group of hits is then examined todetermine if the group hit count is above a threshold value at 350. Inthis example, the threshold value is an integer number of hits. If nogroup includes a hit count above the threshold value, the criteria fordatabase matching is adjusted at 360 and the organization of the hitsinto groups is re-performed at 350. During the criteria adjustment, thematching criteria are moved up the hierarchy of the database. The groupsof hits are then re-examined to determine the groups that have hitcounts above the threshold value. If still no group includes a hit countabove the threshold value, the criteria for database matching isadjusted yet again, and 350 and 360 are repeated. This process isrepeated moving up the hierarchy of the database until one or moregroups of hits are returned that have hit counts above the thresholdvalue.

When one or more groups have hit counts above the threshold value, thegroups of hits are displayed to the user in a manner that allows theuser to visually distinguish between the groups of hits at 370. The useris then able to select each group of hits and view the web pages linkedto the hits within the group.

For ease of understanding, an example of a search query performed on afull text indexed local search engine that performs clustering inaccordance with the present invention will now be described. Turning toFIG. 4, the graphical user interface of the full text indexed localsearch engine is shown and is generally identified by reference numeral420. As can be seen, the search engine includes a text field 435 intowhich a search query is entered. In this example, the query “90210Caesar salad” is entered into the text field 435 and the search isinitialized. In response to the query, the search engine returns a setof indexed results that have the words ‘Caesar’ and ‘salad’ and are inthe 90210 zip code. The returned search result hits are matched to theSIC codes in the criteria database to determine relevant and irrelevanthits. The irrelevant hits are discarded and the relevant hits aregrouped according to SIC code. Each group is then displayed as aheading, with counts showing the number of hits in each group, as shownbelow and as identified by reference numeral 425 in FIG. 4:

-   -   Restaurants (12) Eating and Drinking Establishments (6) Hotels        and Motels (2) Retail Bakeries (1), Membership Sports and        Recreation Clubs (1) Individual and Family Social Services (1)        Hotels, Rooming Houses, Camps and other Lodging Places (1)        Convenience Stores (1) Bands, Orchestras, Actors, and other        Entertainers and Entertainment Groups (1) Automobile Parking (1)

In this particular example, the search engine also displays a list ofall of the relevant hits as identified by reference numeral 430.Selecting one of the headings results in the search engine presentingonly the relevant hits in the associated group.

As can be seen, the restaurants group includes 12 hits, or 44% (12/27)of the total search result hits. Statistically, this is a strongindication that the query “Caesar salad” is most commonly associatedwith restaurants. Table 1 below shows the SIC codes associated with thegroup categories, as well as the hit count for each group. In thisparticular example, the set criteria used to group hits returned duringthe query is commonality of the first three digits of the SIC codes.

TABLE 1 Group Category SIC Code Hit Restaurants 58200000 12 Eating andDrinking Establishments 58100000 6 Hotel and Motel 70110000 3 Retailbakeries 54610000 1 Memberships-Sports and Recreation 79970000 1 ClubsIndividual and Family social services 83200000 1 Convenience stores54110200 1 Bands, orchestras, actors 79290109 1 Automobile parking75210000 1

In this example, if the search criteria fails to return one or moregroups having a hit count above the threshold value, the set criteriaused to group hits returned during the search query is moved up thehierarchy to two digit SIC codes as shown in Table 2 below.

TABLE 2 Group Category SIC Code hit count Eating and drinkingestablishments 58000000 18 Hotel and Motel 70000000 3 Food stores54000000 2 Amusement and recreation services 7900000 2 Social services83000000 1 Automotive repair, services and 75000000 1 parking

The search criteria, if required, could be moved up the hierarchy evenfurther to one digit SIC codes, as shown in Table 3 below.

TABLE 3 Group Category SIC Code Hit Count Eating and drinkingestablishments 58000000 18 Food Stores 54000000 2

If desired, rather than displaying headings representing each of thegroups having hit counts greater than the threshold value, only theheading associated with the group having the highest hit count can bedisplayed. If this is done, the displayed group will, for the most part,yield the most relevant group of hits for most users. A link to theother groups having hit counts greater than the threshold value 15 butlower than the highest hit count can also be displayed.

Alternatively, the groups of hits to be displayed can be based on arelevancy percentage of hits per group versus total search result hits.Thus, if the relevancy threshold is set, for example, at 15%, and asearch resulted in a total of twenty-seven (27) hits, only groups havinga hit count greater than 0.15×27=4.05 (rounded up or down) would bedisplayed.

For example, using the results of Table 1, the “Restaurants” and “Eatingand drink establishment” groups would meet the relevancy threshold. Byusing less specific clustering and moving up the hierarchy, therelevancy percentage can be increased. Using the results of Table 2where the clustering criteria have become more general, the main grouprepresents 66% of the search results.

As will be appreciated by those of skill in the art, adjusting thecriteria either up or down the hierarchy will change the number ofgroups that satisfy the threshold value. The criteria and thresholdvalues may be fixed or user adjustable.

FIGS. 5A and 5B show graphical user interfaces of full text searchengines incorporating clustering to group search results. Referring toFIG. 5A, the graphical user interface 530 presents the group headings535 below the search result hit list 430. Referring to FIG. 5B, thegraphical user interface 550 presents the group headings 535 to the sideof the search result hit list 430.

By organizing search results into groups of hits, visual displays ormaps where the search results are displayed can be improved. In theabove examples, without clustering, there is no mechanism for a user tosee what listings may be similar. Clustering is also useful in a yellowpages implementation, where there is no full text index but thecategories are known. In this case, clustering is used on the yellowpage headings. For example, if there are ten different yellow pagescategories and a different color is assigned to each category, confusionarises because different shades of the same color must be used todifferentiate between various groups. By using clustering, theirrelevant hits can be eliminated and/or hits can be grouped resultingin fewer categories. As a result fewer colors are required, makingcategorization visually more evident.

It will be apparent to those of ordinary skill in the art that methodsinvolved in the present invention may be embodied in a computer programproduct. For example, the database 25 described in reference to FIG. 1may be a collection of information stored on a persistent storage deviceof any computer usable medium. Such a computer usable medium cancomprise a readable memory device, such as a hard drive device (e.g. PCtablet), a CD-ROM, a DVD-ROM, or a disk, having computer readableprogram code stored thereon. The computer readable medium can alsoinclude a communications medium, such as a bus or a communications link,either wired or wireless.

It will also be apparent to those of ordinary skill in the art that thewebsite content extracted, using the data extraction tool 30-4, isdescribed in as extracted textual content for purposes of illustration.Those skilled in the art will appreciate that the content may also beimage based. If the content is an image, machine vision techniques canbe used to determine descriptive attributes from the images, or fromattributes associated with an image.

It should be understood that the optimal parameters/settings for thealgorithms described herein, such as the content analyzer 40-4,clustering algorithms 40-5, etc., may be tuned using optimizationtechniques, such as genetic tuning, machine learning, neural network,Bayesian networks, etc. It should also be understood that the contentanalyzer 40-4 may utilize support vector machines (SVM) to increase therelevancy of resulting indices.

While this invention has been particularly shown and described withreferences to certain embodiments thereof, it will be understood bythose skilled in the art that various changes in form and details may bemade therein without departing from the spirit and scope of theinvention as defined by the appended claims.

1. A computer implemented system for optimizing searching comprising:one or more computer processors executing a search query; a searchengine responsive to the search query; a clustering engine, incommunication with the search engine, processing search result hitsresulting from the query, where the clustering engine clusters the hitsusing an SIC classification of business activities; and the clusteringengine grouping the hits according to their respective SIC code using atleast the first two digits of the SIC code to group hits and eliminatefalse positives.
 2. A computer implemented system for optimizingsearching as in claim 1 wherein the SIC predefined taxonomy of SICbusiness activities are not computer generated.
 3. A computerimplemented system as in claim 1 wherein each of the business activitiesin the SIC predefined taxonomy corresponds to a business descriptor. 4.A computer implemented system as in claim 1 wherein the search engineresponds to the search query by conducting a full-text index search ofwebsite content; and the clustering engine grouping the hits accordingto their respective SIC code by clustering results that are responsiveto the full-text index search, where the full-text index search resultsare clustered based on associated SIC codes.
 5. A computer implementedsystem as in claim 4 wherein at least a portion of the full-text indexsearch results are clustered by: matching SIC codes associated with thefull-text indexed search results; and grouping the full-text indexsearch results having matching SIC codes.
 6. A computer implementedsystem in claim 5 wherein grouping the full-text index search resultshaving matching SIC codes further includes: determining a group hitcount for each group; determining if the group hit count is above athreshold value; displaying groups of hits that are above the thresholdvalue, where each of the displayed groups of hits are displayed with aheading corresponding to their respective SIC code and with countsshowing the number of hits in the group, where the SIC code headingdisplayed is an associated SIC business descriptor.
 7. A computerimplemented system in claim 6 further includes responding to a selectionof one of the SIC code headings displayed by presenting hits in thegroup corresponding to the selected SIC code heading.
 8. A computerimplemented system in claim 5 wherein grouping full-text index searchresults having matching SIC codes further includes using the first threedigits of the SIC codes to group the full-text index search results. 9.A computer implemented system in claim 8 wherein if the full-text indexsearch results fail to include one or more groups of hits having a countabove a threshold value, using the first two digits of the SIC codes togroup hits.